import duckdb
from pathlib import Path
from IPython.display import display
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as pxHospitalizations and costs of ambulatory care sensitive conditions in Brazil
Exploratory and Data Analysis
Introduction
About the data
Ambulatory care sensitive conditions refer to health issues that can be effectively prevented, detected early, treated, and managed through primary health care (APS)1. In Brazil, the Ministry of Health developed its own list of hospitalizations related to these conditions, known as Hospitalizations for Primary Care Sensitive Conditions (ICSAP)2,3.High rates of ICSAP admissions often indicate serious issues with health system access or performance, reflecting potential gaps in preventive care, early diagnosis, or continuity of treatment1. Beyond the clinical implications, these hospitalizations carry a significant economic burden, with avoidable admissions representing billions in health care expenditures.
Objective
This exploratory data analysis aims to quantify the clinical and financial impact of ICSAP in the Brazilian SUS from 2014 to 2024, while assessing data quality and identifying inconsistencies to ensure robust insights on primary care gaps and potential cost savings.
Methodology
Hospital Information System (SIH), is an administrative database managed by the Ministry of Health, which also provides the official ICSAP list3. Macro-region data are sourced from the Brazilian Institute of Geography and Statistics (IBGE).
When processing the SIH-SUS data into the silver layer, hospitalizations whose primary diagnosis was related to obstetric causes or childbirth (ICD-10 codes O80–O84) were excluded4.
This document presents descriptive analyses of the percentage of ICSAP admissions and their costs, broken down annually by age group, gender, and Brazilian macro-regions. Results are initially presented for the period 2014–2024, followed by a more detailed look at data from 2024. Additionally, the most frequent ICSAP admissions are analyzed according to their primary diagnoses.
It is important to note that the values presented here may differ from other publications due to specific methodological choices made in this study
Data contains
The database consists of multiple datasets. Here will be explored data from the sih_sus table.
- id_hospitalization: Id number of the hospitalization.
- year: Year of hospitalization (yyyy).
- month: Month of the hospitalization (mm).
- state_code: State code of healthcare provider.
- length_stay: Number of days spent in hospital.
- gender: Subject’s gender.
- age: Subject’s age (years).
- race_color_code: Code for the Race/color of the subject.
- disease_code: Code of the principal diagnostic based in the ICD-10.
- hospitalization_type_code: Code of the type of hospitalization.
- complexity_code: Code for the hospitalization complexity.
- bed_speciality_code: Code of the hospitalization bed speciality.
- requested_procedure_code: Code for the requested procedure.
- total_cost: Total costs of the hospitalization (BR R$).
- death_flag: Indicate subject death.
Getting started
The first step is to import all the libraries required for the analysis.
Next, we establish a connection with the database, as the data is stored in a DuckDB file.
# Set the file path
base_dir = Path.cwd().parent.parent
duckdb_file = base_dir / "sih_sus.duckdb"
# Connect with the database
con = duckdb.connect(str(duckdb_file))Then, inspect the dimmension of the sih_sus dataset, reviewing the first ten rows, column names and types, basic calculations and presence of null vales.
sih_10 = con.sql("""
SELECT * FROM silver.sih_sus LIMIT 10
""").df()
sih_info = con.sql("""
DESCRIBE silver.sih_sus
""").df()
sih_summa = con.sql("""
SUMMARIZE silver.sih_sus
""").df()
print("First ten roles of the SIH SUS dataset:")
display(sih_10)
print("-------------------------------------------------------------")
print("Table atributes:")
display(sih_info)
print("-------------------------------------------------------------")
print("SIH SUS data summarization:")
display(sih_summa)First ten roles of the SIH SUS dataset:
| id_hospitalization | year | month | state_code | age | gender | race_color_code | disease_code | hospitalization_type_code | complexity_code | bed_speciality_code | requested_procedure_code | length_stay | total_cost | death_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3513121070635 | 2014 | 2 | 35 | 64 | Male | 1 | I500 | 2 | 2 | 03 | 0303060212 | 13 | 1260.71 | 0 |
| 1 | 3513121073913 | 2014 | 2 | 35 | 60 | Female | 1 | I802 | 1 | 2 | 03 | 0303060298 | 17 | 548.97 | 0 |
| 2 | 3513121073968 | 2014 | 2 | 35 | 46 | Male | 1 | C910 | 2 | 2 | 03 | 0304100021 | 7 | 587.90 | 1 |
| 3 | 3513121074012 | 2014 | 2 | 35 | 39 | Male | 1 | J690 | 2 | 2 | 03 | 0303140151 | 8 | 1636.74 | 0 |
| 4 | 3513121074034 | 2014 | 2 | 35 | 75 | Female | 1 | I615 | 2 | 2 | 03 | 0303040149 | 16 | 2126.65 | 1 |
| 5 | 3513121074045 | 2014 | 2 | 35 | 48 | Female | 1 | T838 | 1 | 3 | 03 | 0506020045 | 1 | 138.36 | 0 |
| 6 | 3513121074067 | 2014 | 2 | 35 | 48 | Female | 1 | N390 | 2 | 2 | 03 | 0303150033 | 0 | 298.21 | 0 |
| 7 | 3513121074089 | 2014 | 2 | 35 | 30 | Male | 1 | J690 | 2 | 2 | 03 | 0303140151 | 23 | 7591.28 | 0 |
| 8 | 3513121074090 | 2014 | 2 | 35 | 48 | Female | 1 | T838 | 2 | 3 | 03 | 0506020045 | 7 | 732.96 | 0 |
| 9 | 3513125119735 | 2014 | 2 | 35 | 61 | Female | 1 | A499 | 1 | 2 | 03 | 0303010037 | 10 | 865.91 | 0 |
-------------------------------------------------------------
Table atributes:
| column_name | column_type | null | key | default | extra | |
|---|---|---|---|---|---|---|
| 0 | id_hospitalization | VARCHAR | YES | None | None | None |
| 1 | year | SMALLINT | YES | None | None | None |
| 2 | month | SMALLINT | YES | None | None | None |
| 3 | state_code | SMALLINT | YES | None | None | None |
| 4 | age | SMALLINT | YES | None | None | None |
| 5 | gender | VARCHAR | YES | None | None | None |
| 6 | race_color_code | SMALLINT | YES | None | None | None |
| 7 | disease_code | VARCHAR | YES | None | None | None |
| 8 | hospitalization_type_code | SMALLINT | YES | None | None | None |
| 9 | complexity_code | SMALLINT | YES | None | None | None |
| 10 | bed_speciality_code | VARCHAR | YES | None | None | None |
| 11 | requested_procedure_code | VARCHAR | YES | None | None | None |
| 12 | length_stay | SMALLINT | YES | None | None | None |
| 13 | total_cost | DECIMAL(14,2) | YES | None | None | None |
| 14 | death_flag | INTEGER | YES | None | None | None |
-------------------------------------------------------------
SIH SUS data summarization:
| column_name | column_type | min | max | approx_unique | avg | std | q25 | q50 | q75 | count | null_percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | id_hospitalization | VARCHAR | 1113100216734 | 9924300005444 | 94038206 | None | None | None | None | None | 120119838 | 0.0 |
| 1 | year | SMALLINT | 2014 | 2024 | 12 | 2019.1870472885587 | 3.2104981687250445 | 2016 | 2019 | 2022 | 120119838 | 0.0 |
| 2 | month | SMALLINT | 1 | 12 | 13 | 6.534843337034803 | 3.418173041111256 | 4 | 7 | 9 | 120119838 | 0.0 |
| 3 | state_code | SMALLINT | 11 | 53 | 21 | 32.774509277976215 | 9.441504986848388 | 27 | 33 | 41 | 120119838 | 0.0 |
| 4 | age | SMALLINT | 0 | 99 | 98 | 42.920546196540826 | 24.88953097272519 | 23 | 43 | 63 | 120119838 | 0.0 |
| 5 | gender | VARCHAR | Female | Male | 2 | None | None | None | None | None | 120119838 | 0.0 |
| 6 | race_color_code | SMALLINT | 1 | 99 | 6 | 20.721369595478556 | 38.16587152644139 | 1 | 3 | 3 | 120119838 | 0.0 |
| 7 | disease_code | VARCHAR | A00 | Z999 | 11377 | None | None | None | None | None | 120119838 | 0.0 |
| 8 | hospitalization_type_code | SMALLINT | 1 | 6 | 6 | 1.799002376276931 | 0.5582217046601471 | 2 | 2 | 2 | 120119838 | 0.0 |
| 9 | complexity_code | SMALLINT | 2 | 3 | 2 | 2.0828147470528555 | 0.2756020053551203 | 2 | 2 | 2 | 120119838 | 0.0 |
| 10 | bed_speciality_code | VARCHAR | 01 | Saúde Mental (Clínico) | 19 | None | None | None | None | None | 120119838 | 0.0 |
| 11 | requested_procedure_code | VARCHAR | 0201010038 | 0506020134 | 1685 | None | None | None | None | None | 120119838 | 0.0 |
| 12 | length_stay | SMALLINT | 0 | 364 | 356 | 5.65983494749635 | 8.111488519371962 | 2 | 3 | 6 | 120119838 | 0.0 |
| 13 | total_cost | DECIMAL(14,2) | 0.00 | 274207.48 | 2315519 | 1513.1952969215627 | 3805.741299234852 | 322 | 607 | 1135 | 120119838 | 0.0 |
| 14 | death_flag | INTEGER | 0 | 1 | 2 | 0.051493725790739076 | 0.2210025393554081 | 0 | 0 | 0 | 120119838 | 0.0 |
The initial exploration reveals some key characteristcs of the dataset. SIH-SUS dataset contains 120,119,838 records across eleven years but with 26,081,632 non unique values for id_hospitalization code. This need to be checked!
Min and Max values for the year, lenght_stay and age columns suggests that SIH-SUS covers data from 2014 to 2024, that some individual were admitted and discharged at the same day (0 values) and that the data covers a range (0 to 99) of individuals from newborn to elderly. Also, the null_percentage is 0.0 for all columns. The maximum value of 364 days spent shows that there are very long-term hospitalizations in the sih_sus dataset, which is expected but worth verifying.
The Min value of 0 for total_cost is interesting but may not be an error because some scenarios can be recorded as having zero cost, as such:
- If the main payment is registered on the first AIH, subsequent monthly records might appear with a cost of zero.
- When a patient is transferred between hospitals
- When certain procedures or administrative adjustments within the system might not have a direct cost associated with them but still generate a record.
Let’s have a look into some duplicated id_hospitalization
some_duplicates = con.sql("""
SELECT
id_hospitalization,
COUNT(*) AS total_duplicates
FROM silver.sih_sus
WHERE year = 2014 AND state_code = 35
GROUP BY ALL
HAVING COUNT(*) > 1
LIMIT 10
""").df()
some_duplicates| id_hospitalization | total_duplicates | |
|---|---|---|
| 0 | 3513121299479 | 4 |
| 1 | 3514100438156 | 6 |
| 2 | 3514100434273 | 7 |
| 3 | 3514100430918 | 4 |
| 4 | 3514105710522 | 9 |
| 5 | 3513116876159 | 4 |
| 6 | 3512118708549 | 12 |
| 7 | 3513120582587 | 12 |
| 8 | 3510110855735 | 12 |
| 9 | 3508102554726 | 12 |
The ids vary in number of repetitions. We need to be sure they are not just an error. So let’s search for duplicates in Rio de Janeiro state at 2023, as an example, based on the id_hospitalization, month, age, gender, disease_code, length_stay and total_cost.
sih_dups = con.sql("""
SELECT
id_hospitalization,
month,
age,
gender,
disease_code,
length_stay,
total_cost
FROM silver.sih_sus
WHERE year = 2023 AND state_code = 33
QUALIFY COUNT(*) OVER (PARTITION BY id_hospitalization) > 1
ORDER BY id_hospitalization, age DESC, total_cost DESC
LIMIT 20
""").df()
sih_dups| id_hospitalization | month | age | gender | disease_code | length_stay | total_cost | |
|---|---|---|---|---|---|---|---|
| 0 | 3308100738316 | 5 | 57 | Female | F200 | 31 | 1196.35 |
| 1 | 3308100738316 | 1 | 57 | Female | F200 | 31 | 1196.35 |
| 2 | 3308100738316 | 7 | 57 | Female | F200 | 31 | 1196.35 |
| 3 | 3308100738316 | 3 | 57 | Female | F200 | 31 | 1196.35 |
| 4 | 3308100738316 | 6 | 57 | Female | F200 | 30 | 1157.76 |
| 5 | 3308100738316 | 4 | 57 | Female | F200 | 30 | 1157.76 |
| 6 | 3308100738316 | 2 | 57 | Female | F200 | 28 | 1080.58 |
| 7 | 3308100738350 | 1 | 66 | Female | F205 | 31 | 1196.35 |
| 8 | 3308100738350 | 7 | 66 | Female | F205 | 31 | 1196.35 |
| 9 | 3308100738350 | 5 | 66 | Female | F205 | 31 | 1196.35 |
| 10 | 3308100738350 | 3 | 66 | Female | F205 | 31 | 1196.35 |
| 11 | 3308100738350 | 4 | 66 | Female | F205 | 30 | 1157.76 |
| 12 | 3308100738350 | 6 | 66 | Female | F205 | 30 | 1157.76 |
| 13 | 3308100738350 | 2 | 66 | Female | F205 | 28 | 1080.58 |
| 14 | 3308100738437 | 7 | 59 | Male | F205 | 31 | 1196.35 |
| 15 | 3308100738437 | 10 | 59 | Male | F205 | 31 | 1196.35 |
| 16 | 3308100738437 | 12 | 59 | Male | F205 | 31 | 1196.35 |
| 17 | 3308100738437 | 8 | 59 | Male | F205 | 31 | 1196.35 |
| 18 | 3308100738437 | 6 | 59 | Male | F205 | 30 | 1157.76 |
| 19 | 3308100738437 | 9 | 59 | Male | F205 | 30 | 1157.76 |
Apparently is not an error because they happen all in different months. We can look if the same id_hospitalization repeats over the years and if it has the same pattern.
sih_dups_years = con.sql("""
SELECT
id_hospitalization,
year,
month,
age,
gender,
disease_code,
length_stay,
total_cost
FROM silver.sih_sus
WHERE state_code = 33 AND id_hospitalization = '3308100738350'
ORDER BY id_hospitalization, year, month
""").df()
display(sih_dups_years.head(20))
sih_dups_years['year'].value_counts().sort_index()| id_hospitalization | year | month | age | gender | disease_code | length_stay | total_cost | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3308100738350 | 2014 | 1 | 57 | Female | F205 | 31 | 1196.35 |
| 1 | 3308100738350 | 2014 | 3 | 57 | Female | F205 | 31 | 1196.35 |
| 2 | 3308100738350 | 2014 | 4 | 57 | Female | F205 | 30 | 1157.76 |
| 3 | 3308100738350 | 2014 | 5 | 57 | Female | F205 | 31 | 1196.35 |
| 4 | 3308100738350 | 2014 | 6 | 57 | Female | F205 | 30 | 1157.76 |
| 5 | 3308100738350 | 2014 | 7 | 57 | Female | F205 | 31 | 1196.35 |
| 6 | 3308100738350 | 2015 | 3 | 58 | Female | F205 | 31 | 1157.76 |
| 7 | 3308100738350 | 2015 | 4 | 58 | Female | F205 | 30 | 1157.76 |
| 8 | 3308100738350 | 2015 | 5 | 58 | Female | F205 | 31 | 1196.35 |
| 9 | 3308100738350 | 2015 | 6 | 58 | Female | F205 | 30 | 1157.76 |
| 10 | 3308100738350 | 2015 | 7 | 58 | Female | F205 | 31 | 1157.76 |
| 11 | 3308100738350 | 2015 | 8 | 58 | Female | F205 | 31 | 1196.35 |
| 12 | 3308100738350 | 2015 | 10 | 58 | Female | F205 | 31 | 1196.35 |
| 13 | 3308100738350 | 2015 | 11 | 59 | Female | F205 | 30 | 1157.76 |
| 14 | 3308100738350 | 2015 | 12 | 59 | Female | F205 | 31 | 1196.35 |
| 15 | 3308100738350 | 2016 | 3 | 59 | Female | F205 | 31 | 1196.35 |
| 16 | 3308100738350 | 2016 | 4 | 59 | Female | F205 | 30 | 1157.76 |
| 17 | 3308100738350 | 2016 | 5 | 59 | Female | F205 | 31 | 1196.35 |
| 18 | 3308100738350 | 2016 | 6 | 59 | Female | F205 | 30 | 1157.76 |
| 19 | 3308100738350 | 2016 | 7 | 59 | Female | F205 | 31 | 1196.35 |
year
2014 6
2015 9
2016 11
2017 13
2018 12
2019 12
2020 12
2021 12
2022 12
2023 7
Name: count, dtype: int64
It looks like the same patern repeat over the years, therefore can be a continuos treatment/hospitalization.
But, if more than one identical record per month happens it is an error! Let’s check.
some_real_dups = con.sql("""
SELECT
id_hospitalization,
month,
disease_code,
length_stay,
total_cost,
COUNT(*) AS total_duplicates
FROM silver.sih_sus
WHERE year = 2023 AND state_code = 33
GROUP BY ALL
HAVING COUNT(*) > 1
ORDER BY month
LIMIT 10
""").df()
some_real_dups| id_hospitalization | month | disease_code | length_stay | total_cost | total_duplicates | |
|---|---|---|---|---|---|---|
| 0 | 3323101961696 | 4 | F200 | 28 | 1341.93 | 2 |
| 1 | 3323107430852 | 8 | F312 | 31 | 2774.79 | 2 |
| 2 | 3323103325949 | 8 | F70 | 31 | 2554.39 | 2 |
| 3 | 3321102499638 | 8 | F200 | 31 | 2554.39 | 2 |
| 4 | 3323105680037 | 8 | F200 | 31 | 1491.04 | 2 |
| 5 | 3323107430885 | 8 | F200 | 31 | 2774.79 | 2 |
| 6 | 3323105067766 | 8 | F200 | 31 | 2554.39 | 2 |
| 7 | 3322109138117 | 8 | F200 | 31 | 2554.39 | 2 |
| 8 | 3323103424839 | 8 | F29 | 31 | 2554.39 | 2 |
| 9 | 3322109100740 | 8 | F21 | 31 | 2554.39 | 2 |
Yes, there are some identical records per month, so they need to be filtered out before the analysis.
Now let’s have a look for the number of records with a length of 364 days and the ones with zero cost
sih_364_stay = con.sql("""
WITH unique_obs AS (
SELECT DISTINCT
id_hospitalization,
gender,
age,
state_code,
month,
disease_code,
length_stay,
total_cost
FROM silver.sih_sus
)
SELECT *
FROM unique_obs
WHERE length_stay = 364;
""").df()
sih_zero_cost = con.sql("""
WITH unique_obs AS (
SELECT DISTINCT
id_hospitalization,
year,
month,
disease_code,
length_stay,
total_cost
FROM silver.sih_sus
)
SELECT *
FROM unique_obs
WHERE total_cost = 0
""").df()
print(f"There are {sih_364_stay.shape[0]} observations recorded with 364 days of stay.")
print(f"There are {sih_zero_cost.shape[0]} observations recorded with zero cost.")
display(sih_364_stay)There are 12 observations recorded with 364 days of stay.
There are 280156 observations recorded with zero cost.
| id_hospitalization | gender | age | state_code | month | disease_code | length_stay | total_cost | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3316104693143 | Male | 10 | 33 | 11 | J849 | 364 | 14935.84 |
| 1 | 3318100547586 | Female | 55 | 33 | 8 | K808 | 364 | 695.77 |
| 2 | 3319109926638 | Female | 1 | 33 | 8 | Q038 | 364 | 8762.22 |
| 3 | 3314103512676 | Male | 7 | 33 | 1 | A419 | 364 | 4781.71 |
| 4 | 2322101258829 | Male | 30 | 23 | 3 | K928 | 364 | 347.15 |
| 5 | 3316101141441 | Male | 5 | 33 | 1 | N47 | 364 | 219.12 |
| 6 | 3321108502350 | Male | 2 | 33 | 2 | A415 | 364 | 5435.85 |
| 7 | 3319109861463 | Male | 1 | 33 | 12 | Q743 | 364 | 3632.71 |
| 8 | 3118128823865 | Female | 61 | 31 | 1 | A418 | 364 | 2166.80 |
| 9 | 3316106534906 | Male | 9 | 33 | 11 | P271 | 364 | 103868.75 |
| 10 | 3515119823435 | Male | 1 | 35 | 11 | Q210 | 364 | 22481.91 |
| 11 | 1123101678700 | Male | 4 | 11 | 2 | G120 | 364 | 6453.57 |
It seems that hospitalizations recorded with a length of 364 days of stay are valid, since they are distributed across different gender, ages, month and brazilian state.
However, there are many entries recorded with cost equal to zero. I do not have enough information about the system to know if these records are valid or an error or anything else, so I will filter them out for further analysis.
Create a table without duplicates into the database.
con.execute("""
CREATE OR REPLACE TABLE silver.sih_sus_eda_clean AS
SELECT DISTINCT
id_hospitalization,
year,
month,
state_code,
gender,
race_color_code,
age,
disease_code,
length_stay,
total_cost
FROM silver.sih_sus
WHERE total_cost > 0
""")Exploratory analysis
The next step is to explore and analyze ICSAP profiles and costs within SIH-SUS by join with the cid10_icsap table.
ICSAP Metrics By Year (2014 to 2024)
How have primary care-sensitive hospitalizations and costs evolved over the last ten years?
# Query the results
# Counting totals and hosp costs
sih_sus_year = con.sql("""
SELECT
year,
COUNT(*) AS total_hospitalization,
COUNT(icsap.avoidable_disease_code) AS total_icsap,
ROUND(COUNT(icsap.avoidable_disease_code) * 100.0 / COUNT(*), 2) AS rate_icsap,
SUM(CASE
WHEN icsap.avoidable_disease_code IS NOT NULL THEN total_cost
END) AS icsap_cost
FROM silver.sih_sus_eda_clean
LEFT JOIN silver.cid10_icsap AS icsap
ON disease_code = icsap.avoidable_disease_code
GROUP BY year
ORDER BY year
""").to_df()Code
# Currency formatter function
def format_brl(value):
if value >= 1e9:
return f"R${value / 1e9:.2f}B"
elif value >= 1e6:
return f"R${value / 1e6:.1f}M"
else:
return f"R${value:,.0f}"Code
# Create plot
sns.set_theme(style="white", palette="colorblind")
# Bar plot
fig, ax = plt.subplots()
sns.barplot(
data=sih_sus_year,
x="year", y="rate_icsap",
ax=ax
)
ax.bar_label(ax.containers[0], fmt='%.1f%%', fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'left']].set_visible(False)
ax.yaxis.set_visible(False)
ax.set_xlabel("")
plt.show()
# Horizontal bar plot
sih_sus_costs_year = sih_sus_year.sort_values(by='year', ascending=False).reset_index()
year_order = sih_sus_costs_year['year']
fig, ax = plt.subplots()
sns.barplot(
data=sih_sus_costs_year,
x="icsap_cost", y="year",
orient='h',
ax=ax, order=year_order
)
# Format values for bar labels
formatted_values = [format_brl(v) for v in sih_sus_costs_year["icsap_cost"]]
ax.bar_label(ax.containers[0], labels=formatted_values, padding=-50, fontsize=9, color='white', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()From 2014 to 2024, Brazil experienced a gradual decline in the proportion of avoidable hospitalizations (ICSAP) relative to total hospitalizations, dropping from 10.6% in 2014 to 8.4% in 2024. The absolute number of ICSAP cases remained relatively stable over the years.
In terms of financial impact, ICSAP-related costs increased significantly, reaching over R$1.2 billion in 2024.
ICSAP Metrics By year and Brazilian Macro-regions
To add a regional perspective, we join the dataset with the uf_localidade table, which contains information about Brazil’s macro-regions.
sih_sus_regional = con.sql("""
SELECT
sih.year,
uf.big_region_name,
COUNT(*) AS total_hospitalizations,
COUNT(icsap.avoidable_disease_code) AS total_icsap,
ROUND(COUNT(icsap.avoidable_disease_code) * 100.0 / COUNT(*), 2) AS rate_icsap,
ROUND(SUM(sih.total_cost)) AS total_cost,
ROUND(SUM(CASE
WHEN icsap.avoidable_disease_code IS NOT NULL THEN sih.total_cost
ELSE 0
END), 2) AS icsap_cost,
ROUND(SUM(CASE
WHEN icsap.avoidable_disease_code IS NOT NULL THEN sih.total_cost
ELSE 0
END) * 100.0 / SUM(sih.total_cost), 2) AS rate_icsap_cost
FROM silver.sih_sus_eda_clean AS sih
LEFT JOIN silver.cid10_icsap AS icsap
ON sih.disease_code = icsap.avoidable_disease_code
LEFT JOIN silver.uf_localidade AS uf
ON sih.state_code = uf.state_code
GROUP BY sih.year, uf.big_region_name
ORDER BY sih.year, uf.big_region_name
""").to_df()Code
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")
# Bar plot
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(
data=sih_sus_regional,
x="year",
y="rate_icsap",
hue="big_region_name",
ax=ax
)
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("ICSAP (%)")
ax.set_xlabel("")
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=5)
plt.tight_layout()
plt.show()
# Line Plot
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(
data=sih_sus_regional,
x="year", y="rate_icsap_cost",
hue="big_region_name",
marker="o",
ax=ax
)
ax.set_ylabel("ICSAP Cost (%)")
ax.set_xlabel("")
ax.set_xticks(sorted(sih_sus_regional["year"].unique()))
ax.set_xticklabels(sorted(sih_sus_regional["year"].unique()))
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=5)
plt.tight_layout()
plt.show()The North consistently shows the highest ICSAP rates — over 10% annually — indicating structural challenges in primary care access. The Northeast follows closely, with high volumes and cost shares, reflecting its large population and uneven healthcare coverage. Meanwhile, the South and Southeast maintain lower ICSAP rates and cost proportions, though the Southeast still incurs the highest absolute costs due to its large population and hospitalization scale. The Centro-Oeste region remains intermediate in both metrics.
ICSAP Metrics by the Last Year (2024)
Focusing on 2024, we analyze ICSAP by region, age group, and gender.
con.sql("""
SELECT COUNT(*)
FROM silver.sih_sus_eda_clean AS sih
LEFT JOIN silver.cid10_icsap AS icsap
ON sih.disease_code = icsap.avoidable_disease_code
WHERE year = 2024 AND icsap.avoidable_disease_code IS NOT NULL
""").show()┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 1103170 │
└──────────────┘
As the dataset reduced to 1,103,170 rows (more than 99%) we will query the data into a pandas dataframe to continue with analysis.
# Query the data
sih_icsap_2024 = con.sql("""
SELECT
uf.big_region_name,
uf.state_name,
sih.length_stay,
sih.gender,
sih.race_color_code,
sih.age,
icsap.disease_category,
sih.total_cost
FROM silver.sih_sus_eda_clean AS sih
LEFT JOIN silver.cid10_icsap AS icsap
ON sih.disease_code = icsap.avoidable_disease_code
LEFT JOIN silver.uf_localidade AS uf
ON sih.state_code = uf.state_code
WHERE sih.year = 2024
AND sih.total_cost > 0
AND icsap.avoidable_disease_code IS NOT NULL
""").to_df()
# Data summary
def column_summary(df):
summary_data = []
for col_name in df.columns:
col_dtype = df[col_name].dtype
num_of_nulls = df[col_name].isnull().sum()
num_of_non_nulls = df[col_name].notnull().sum()
num_of_distinct_values = df[col_name].nunique()
summary_data.append({
'col_name': col_name,
'col_dtype': col_dtype,
'num_of_nulls': num_of_nulls,
'num_of_non_nulls': num_of_non_nulls,
'num_of_distinct_values': num_of_distinct_values
})
summary_df = pd.DataFrame(summary_data)
return summary_df
summary_df = column_summary(sih_icsap_2024)
print("Summary of the data:")
display(summary_df)
print("------------------------------------------------------------------------")
describe_df = sih_icsap_2024.describe()
print("Descriptive statistic of the data:")
display(describe_df)Summary of the data:
| col_name | col_dtype | num_of_nulls | num_of_non_nulls | num_of_distinct_values | |
|---|---|---|---|---|---|
| 0 | big_region_name | object | 0 | 1103170 | 5 |
| 1 | state_name | object | 0 | 1103170 | 27 |
| 2 | length_stay | int16 | 0 | 1103170 | 186 |
| 3 | gender | object | 0 | 1103170 | 2 |
| 4 | race_color_code | Int16 | 3 | 1103167 | 5 |
| 5 | age | int16 | 0 | 1103170 | 100 |
| 6 | disease_category | object | 0 | 1103170 | 19 |
| 7 | total_cost | float64 | 0 | 1103170 | 214100 |
------------------------------------------------------------------------
Descriptive statistic of the data:
| length_stay | race_color_code | age | total_cost | |
|---|---|---|---|---|
| count | 1.103170e+06 | 1103167.0 | 1.103170e+06 | 1.103170e+06 |
| mean | 6.293833e+00 | 2.321867 | 5.156380e+01 | 1.129597e+03 |
| std | 7.479414e+00 | 0.968601 | 2.800763e+01 | 2.678282e+03 |
| min | 0.000000e+00 | 1.0 | 0.000000e+00 | 2.198000e+01 |
| 25% | 2.000000e+00 | 1.0 | 2.900000e+01 | 3.280100e+02 |
| 50% | 4.000000e+00 | 3.0 | 5.900000e+01 | 5.207200e+02 |
| 75% | 7.000000e+00 | 3.0 | 7.400000e+01 | 8.659100e+02 |
| max | 3.430000e+02 | 5.0 | 9.900000e+01 | 1.679245e+05 |
Overall, the data reveals high heterogeneity in both cost and clinical profiles, with strong variability across hospitalizations.
The dataset has fill missing values (only 3 in race_color_code). It spans five macro-regions, 27 states, and includes 19 disease categories. The total_cost variable shows high variability, with a range from R$21.98 to R$167,924.50. The average hospitalization cost is approximately R$1,129.61, but the standard deviation is more than twice that (R$2,678.28), indicating a highly skewed distribution with probably substantial outliers.
Length_stay also exhibits variability, with a mean of 6 days and a standard deviation of 7 days. While most hospitalizations are short (median of 4 days), some extend up to 343 days, suggesting the presence of complex or chronic cases. The average age is 51 years, and the interquartile range (29 to 74) reflects a concentration of adult and elderly patients.
Thus, the statistic suggest that while most cases are relatively short and moderately costly, a subset of patients drives up the average.
So, we can analyse if this could be related due to expanded stays and or age group.
Total Hospitalizations and Costs
First, make a copy of the dataset and create an age group and race names.
# Make a copy of the data
df = sih_icsap_2024.copy()
#Create an age group
bins = [-1, 4, 14, 24, 44, 64, float('inf')]
labels = ['0-4', '5-14', '15-24', '25-44', '45-64', '65+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True)
#Map race names (IBGE Brazil)
race_names_map = {
1: "Branca",
2: "Preta",
3: "Amarela",
4: "Parda",
5: "Indígena"
}
df['race_color'] = df['race_color_code'].map(race_names_map)
df.head(10)| big_region_name | state_name | length_stay | gender | race_color_code | age | disease_category | total_cost | age_group | race_color | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sul | Paraná | 7 | Female | 1 | 48 | Doenças Cerebrovasculares | 1184.66 | 45-64 | Branca |
| 1 | Sul | Paraná | 5 | Female | 1 | 96 | Infecção no Rim e Trato Urinário | 1063.47 | 65+ | Branca |
| 2 | Sul | Paraná | 4 | Female | 3 | 54 | Doenças Cerebrovasculares | 2113.56 | 45-64 | Amarela |
| 3 | Nordeste | Pernambuco | 5 | Female | 2 | 55 | Doenças Cerebrovasculares | 3717.67 | 45-64 | Preta |
| 4 | Nordeste | Pernambuco | 11 | Female | 3 | 56 | Doenças Cerebrovasculares | 1170.17 | 45-64 | Amarela |
| 5 | Nordeste | Pernambuco | 3 | Female | 3 | 60 | Hipertensão | 264.32 | 45-64 | Amarela |
| 6 | Nordeste | Pernambuco | 4 | Male | 3 | 4 | Gastroenterites Infecciosas e complicações | 356.90 | 0-4 | Amarela |
| 7 | Nordeste | Pernambuco | 5 | Male | 3 | 51 | Úlcera gastrointestinal | 870.73 | 45-64 | Amarela |
| 8 | Nordeste | Pernambuco | 6 | Female | 3 | 66 | Diabetes melitus | 4137.38 | 65+ | Amarela |
| 9 | Nordeste | Pernambuco | 9 | Female | 3 | 80 | Infecção no Rim e Trato Urinário | 586.09 | 65+ | Amarela |
We can analyse the relationship between lenght of stay and the total cost and visualize the distribution of the age groups by doing a scatter plot.
# Get a sample of the data
sample_df = df.sample(n=100000, random_state=42)Code
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")
g = sns.relplot(
data=sample_df,
x='length_stay', y='total_cost',
col='age_group', hue='gender', col_wrap=3,
kind='scatter', height=4, aspect=1.2, alpha=0.5
)
sns.move_legend(
g, "upper center",
bbox_to_anchor=(.5, 1.0),
ncol=2, title=None, frameon=False
)
g.set_axis_labels("Length of Stay (Days)", "Total Cost (R$)")
g.set_titles("Age Group: {col_name}")
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
# We can filter the data to lenght stays equal or less then 100 days and
# costs equal or less then 60K for better visualization
filtered_df = sample_df[
(sample_df["length_stay"] <= 100) &
(sample_df["total_cost"] <= 60000)
]
g = sns.relplot(
data=filtered_df,
x='length_stay', y='total_cost',
col='age_group', hue='gender', col_wrap=3,
kind='scatter', height=4, aspect=1.2, alpha=0.5
)
sns.move_legend(
g, "upper center",
bbox_to_anchor=(.5, 1.0),
ncol=2, title=None, frameon=False
)
g.set_axis_labels("Length of Stay (Days)", "Total Cost (R$)")
g.set_titles("Age Group: {col_name}")
plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()There doesn’t appear to be a clear relationship between length of stay and total cost. We observe both short hospitalizations with high costs and long stays with either high or low costs.
Both age groups are represented across the spectrum, but individuals aged 65+ and 45–64 seem to have a higher concentration of cases with elevated costs and/or extended durations.
The presence of extreme values and wide interquartile ranges suggests that while most hospitalizations are moderately priced, a subset incurs significantly higher costs.
Now, we can analyse the statistic grouped by age group and gender to see the patterms.
summary_gender = df.groupby(["gender"])["total_cost"].describe().round(2)
summary_age = df.groupby(["age_group"])["total_cost"].describe().round(2)
display(summary_gender)
display(summary_age)| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | ||||||||
| Female | 557906.0 | 1055.83 | 2548.52 | 21.98 | 317.64 | 487.21 | 805.72 | 136618.59 |
| Male | 545264.0 | 1205.07 | 2802.84 | 21.98 | 340.90 | 574.46 | 900.97 | 167924.52 |
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age_group | ||||||||
| 0-4 | 105421.0 | 781.47 | 1858.71 | 40.38 | 332.90 | 447.48 | 625.44 | 90287.23 |
| 5-14 | 90342.0 | 732.19 | 1633.27 | 40.38 | 332.90 | 410.85 | 622.42 | 74917.45 |
| 15-24 | 51335.0 | 813.00 | 2089.71 | 40.38 | 226.68 | 332.90 | 582.42 | 116257.54 |
| 25-44 | 132312.0 | 1077.99 | 3317.94 | 23.45 | 250.68 | 421.14 | 835.46 | 136618.59 |
| 45-64 | 263406.0 | 1317.56 | 2963.40 | 21.98 | 347.15 | 582.41 | 989.52 | 136999.63 |
| 65+ | 460354.0 | 1229.90 | 2664.24 | 21.98 | 355.15 | 587.07 | 927.57 | 167924.52 |
Code
# Counting the total of hospitalizations
# By gender
count_gender = df.groupby(["gender"]).size().reset_index(name='total_hospitalizations')
# By age group
count_age_group = df.groupby(["age_group"]).size().reset_index(name='total_hospitalizations')
# Create the plots
sns.set_theme(style="white", palette="colorblind")
# Bar plot for gender
fig, ax = plt.subplots()
sns.barplot(
data=count_gender,
x="gender", y="total_hospitalizations",
hue='gender',
ax=ax
)
for container in ax.containers:
ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("Total Hospitalizations")
ax.set_xlabel("")
plt.show()
# Bar plot for age group
fig, ax = plt.subplots()
sns.barplot(
data=count_age_group,
x="age_group", y="total_hospitalizations",
hue='age_group',
ax=ax
)
for container in ax.containers:
ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("Total Hospitalizations")
ax.set_xlabel("")
plt.show()Code
# Create the plots
sns.set_theme(style="whitegrid", palette="colorblind")
# Gender
fig, ax = plt.subplots()
sns.boxplot(data=sample_df, x='gender', y='total_cost', hue='gender')
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()
# Gender no outliers
fig, ax = plt.subplots()
sns.boxplot(
data=sample_df,
x='gender', y='total_cost', hue='gender',
showfliers=False # Hide outliers
)
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()
# Age group
fig, ax = plt.subplots()
sns.boxplot(data=sample_df, x='age_group', y='total_cost', hue='age_group')
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()
# Age group filtered
fig, ax = plt.subplots()
sns.boxplot(
data=sample_df,
x='age_group', y='total_cost', hue='age_group',
showfliers=False # Hide outliers
)
ax.set_ylabel("Total Cost (R$)")
ax.set_xlabel("")
plt.show()For gender, males have a higher average cost (R$1,205.07) compared to females (R$1,055.83), and also a higher standard deviation (R$2,802.84 vs. R$2,548.52), indicating greater dispersion in male hospitalization expenses. Both genders share the same minimum cost (R$21.98), but males reach a higher maximum (R$167,924.52), suggesting the presence of extreme outliers.
When analyzing age groups, hospitalization costs tend to increase with age. Children aged 0–4 and 5–14 have the lowest average costs, while adults aged 45–64 and seniors 65+ show the highest averages. The standard deviation also grows with age, reflecting more variability in older populations.
And for races? Is there a difference?
Code
# Counting the total of hospitalizations by race
count_race = (df.groupby(["race_color"])
.size()
.reset_index(name='total_hospitalizations')
.sort_values(by='total_hospitalizations', ascending=False)
)
# Bar plot
sns.set_theme(style="white", palette="colorblind")
fig, ax = plt.subplots()
sns.barplot(
data=count_race,
x="race_color", y="total_hospitalizations",
hue='race_color',
ax=ax
)
for container in ax.containers:
ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top']].set_visible(False)
ax.set_ylabel("Total Hospitalizations")
ax.set_xlabel("")
plt.show()We also can have a look in those patterms by macro-regions wiht the metrics agrouped by gender and age group.
# Prepare data
# Metrics agrouoed by gender
df_region_gender = (df.groupby(['big_region_name', 'gender'])
.size()
.reset_index(name='total_hospitalizations')
.sort_values(by='total_hospitalizations', ascending=False)
)
df_region_gender_cost = (df.groupby(['big_region_name', 'gender'])
.agg(total_cost=('total_cost', 'sum'))
.reset_index()
)
# Metrics agrouped by age group
df_region_age = (df.groupby(['big_region_name', 'age_group'])
.size()
.reset_index(name='total_hospitalizations')
)
df_region_age_cost = (df.groupby(['big_region_name', 'age_group'])
.agg(total_cost=('total_cost', 'sum'))
.reset_index()
)
# Metrics agrouped by race
df_region_race = (df.groupby(['big_region_name', 'race_color'])
.size()
.reset_index(name='total_hospitalizations')
)Code
# Define the order
region_order = df_region_gender['big_region_name']
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")
# For gender count
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(
data=df_region_gender,
x='big_region_name', y='total_hospitalizations',
hue='gender',
ax=ax, order=region_order
)
for container in ax.containers:
ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'left']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=2)
ax.yaxis.set_visible(False)
ax.set_xlabel("")
plt.show()
# For gender cost
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(
data=df_region_gender_cost,
x='big_region_name', y='total_cost', hue='gender',
ax=ax, order=region_order, width=0.9
)
for container in ax.containers:
values = container.datavalues
labels = [format_brl(v) for v in values]
ax.bar_label(container, labels=labels, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'left']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=2)
ax.set_ylabel("")
ax.yaxis.set_visible(False)
ax.set_xlabel("")
plt.show()Code
# Define the order
region_order = ['Sudeste', 'Nordeste', 'Sul', 'Norte', 'Centro-Oeste']
age_group_order = ['65+', '45-64', '25-44', '0-4', '5-14', '15-24']
# Create plot
sns.set_theme(style="white", palette="colorblind")
# For age group count
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(
data=df_region_age,
x='total_hospitalizations', y='big_region_name',
orient='h', hue='age_group', hue_order=age_group_order,
ax=ax, order=region_order
)
for container in ax.containers:
ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=6)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()
# For age group cost
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(
data=df_region_age_cost,
x='total_cost', y='big_region_name',
orient='h', hue='age_group', hue_order=age_group_order,
ax=ax, order=region_order
)
for container in ax.containers:
values = container.datavalues
labels = [format_brl(v) for v in values]
ax.bar_label(container, labels=labels, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=6)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()Code
# Define the order
region_order = ['Sudeste', 'Nordeste', 'Sul', 'Norte', 'Centro-Oeste']
race_group_order = df_region_race['total_hospitalizations']
# Create plot
sns.set_theme(style="white", palette="colorblind")
# For gender count
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(
data=df_region_race,
x='total_hospitalizations', y='big_region_name',
orient='h', hue='race_color',
ax=ax, order=region_order
)
for container in ax.containers:
ax.bar_label(container, fontsize=9, color='black', fontweight='bold')
ax.spines[['right', 'top', 'bottom']].set_visible(False)
ax.legend(title=None, loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=5)
ax.xaxis.set_visible(False)
ax.set_ylabel("")
plt.show()Across Brazil’s macro regions, hospitalization patterns reveal consistent gender and age-related trends. The Southeast region recorded the highest number of hospitalizations overall, with female patients slightly outnumbering males.
Older adults (65+) accounted for the largest share of hospitalizations in every region, particularly in the Southeast and Northeast. The 45–64 age group followed closely.
Total costs mirrored these patterns. The Southeast incurred the highest total hospitalization costs for both genders. When broken down by age group, the 65+ population consistently drove the highest costs across all regions, peaking in the Southeast at over R$259 million. The 45–64 group also contributed substantially, reinforcing the financial impact of aging populations on the healthcare system.
ICSAP Disease Categories
Let’s look at the avoidable disease categories, the main categories and their expenditures.
disease_category = (
df.groupby('disease_category')
.agg(total_hospitalizations=('disease_category', 'size'),
total_cost=('total_cost', 'sum'),
avg_cost=('total_cost', 'mean'),
avg_stay=('length_stay', 'mean'))
.sort_values(by='total_hospitalizations', ascending=False)
)
styled_tb = (
disease_category.style
.background_gradient(cmap='YlGnBu', high=0.2, subset=['total_cost', 'avg_cost', 'avg_stay'])
.format({'total_cost': 'R$ {:,.2f}'})
)
styled_tb| total_hospitalizations | total_cost | avg_cost | avg_stay | |
|---|---|---|---|---|
| disease_category | ||||
| Infecção no Rim e Trato Urinário | 218241 | R$ 126,877,175.19 | 581.362692 | 5.674319 |
| Doenças Cerebrovasculares | 197963 | R$ 365,759,200.23 | 1847.613949 | 7.290999 |
| Pneumonias bacterianas | 168655 | R$ 243,172,884.92 | 1441.836204 | 6.828223 |
| Diabetes melitus | 136172 | R$ 151,124,585.96 | 1109.806612 | 6.779984 |
| Gastroenterites Infecciosas e complicações | 126223 | R$ 51,377,873.76 | 407.040506 | 3.333283 |
| Infecção da pele e tecido subcutâneo | 61046 | R$ 52,273,362.37 | 856.294636 | 7.341382 |
| Úlcera gastrointestinal | 50844 | R$ 60,521,054.37 | 1190.328345 | 5.599599 |
| Hipertensão | 37068 | R$ 18,044,255.16 | 486.787935 | 4.222132 |
| Insuficiência Cardíaca | 26020 | R$ 71,171,979.81 | 2735.279777 | 8.264374 |
| Doenças preveníveis por imunização e condições sensíveis | 23839 | R$ 46,300,850.19 | 1942.231226 | 13.726541 |
| Doencas pulmonares | 22406 | R$ 20,922,481.99 | 933.789252 | 4.713470 |
| Deficiências Nutricionais | 19316 | R$ 24,227,126.98 | 1254.251759 | 8.666960 |
| Asma | 6492 | R$ 4,849,799.21 | 747.042392 | 3.428990 |
| Infecções de ouvido, nariz e garganta | 2239 | R$ 670,022.02 | 299.250567 | 2.887896 |
| Doenças relacionadas ao Pré-Natal e Parto | 1827 | R$ 1,143,375.44 | 625.821259 | 6.281883 |
| Eplepsias | 1485 | R$ 1,473,095.85 | 991.983737 | 5.622896 |
| Angina | 1132 | R$ 5,301,099.39 | 4682.949991 | 4.873675 |
| Anemia | 1121 | R$ 446,850.07 | 398.617368 | 3.992864 |
| Doença Inflamatória órgãos pélvicos femininos | 1081 | R$ 480,510.46 | 444.505513 | 2.449584 |
While urinary tract infections top the list in volume, cerebrovascular diseases dominate in total cost, exceeding R$365 million despite fewer admissions. Bacterial pneumonia and diabetes mellitus also represent high-burden conditions, combining frequent hospitalizations with substantial expenditures.
Interestingly, some conditions—such as heart failure and angina—stand out for their high cost per hospitalization. Heart failure, with just over 26,000 admissions, incurred roughly R$2,000 per hospitalization, while angina, with only 1,132 admissions, exceeded R$4,000 per hospitalization.
Immunization-preventable conditions still account for over 23,000 hospitalizations and R$46 million in costs — highlighting gaps in preventive coverage.
# Identify the 3 categories with the most hospitalizations
top3_categories = df['disease_category'].value_counts().nlargest(3).index.tolist()
# Filter the data
filter_top3 = df['disease_category'].isin(top3_categories)
df_top3 = df[filter_top3].copy()Code
# Create plot
sns.set_theme(style="whitegrid", palette="colorblind")
# For total cost
c = sns.catplot(
data=df_top3,
x='age_group', y='total_cost', hue='age_group',
col='disease_category',
kind='box', col_wrap=3,
showfliers=False # Hide outliers
)
c.set_axis_labels("", "Total Cost (R$)")
c.set_titles("{col_name}")
plt.show()
# For length stay
s = sns.catplot(
data=df_top3,
x='age_group', y='length_stay', hue='age_group',
col='disease_category',
kind='box', col_wrap=3,
showfliers=False # Hide outliers
)
s.set_axis_labels("", "Length of Stay (Days)")
s.set_titles("{col_name}")
plt.show()# Total hosp and cost by macro-regions
treemap_data = df.groupby(['big_region_name', 'disease_category']).agg(
total_hospitalizations=('disease_category', 'size'),
total_cost=('total_cost', 'sum')
).sort_values('total_hospitalizations', ascending=False).reset_index()
treemap_top5 = treemap_data.groupby('big_region_name').head(5)
# Get Brazil top disease category
top_category_brasil = df['disease_category'].value_counts().idxmax()
br_top_hosp = (df['disease_category'] == top_category_brasil).sum()
br_top_cost = df.loc[df['disease_category'] == top_category_brasil, 'total_cost'].sum()
root_label = f"Brazil - 2024"Code
# Create plot
# Treemap plot
fig_treemap = px.treemap(
treemap_top5,
path=[px.Constant(root_label), 'big_region_name', 'disease_category'],
values='total_hospitalizations',
color='total_cost',
color_continuous_scale='YlGnBu',
custom_data=['total_cost'],
labels={'total_cost': 'Total Cost'}
)
default_hover = (
"<b>%{label}</b><br>"
"Total Hospitalizations: %{value:,d}<br>"
"Total Cost: R$ %{customdata[0]:,.2f}<extra></extra>"
)
fig_treemap.update_traces(hovertemplate=default_hover)
labels = list(fig_treemap.data[0]['labels'])
root_idx = labels.index(root_label) # o índice do nó raiz
hovertemplates = [default_hover] * len(labels)
hovertemplates[root_idx] = (
f"<b>{root_label}</b><br>"
f"Brazil Top Disease Category: <b>{top_category_brasil}</b><br>"
f"Total Hospitalizations: {br_top_hosp:,d}<br>"
f"Total Cost: R$ {br_top_cost:,.2f}<extra></extra>"
)
fig_treemap.data[0].hovertemplate = hovertemplates
fig_treemap.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig_treemap.show()Conclusions
About the dataset:
Dataset size and structure: The SIH-SUS dataset is very large; therefore, it is more efficient to run queries that aggregate information before importing the data into Power BI.
Data quality issues:
- Missing data is minimal, but there are records with zero cost that need filtering for financial analyses.
- Both cost and length of stay show high variability and skewness, indicating that averages can be influenced by a small subset of complex cases.
- There are duplicated records by month that need to be handle before ML models.
About the data:
Gradual reduction in ICSAP: The proportion of hospitalizations for primary care–sensitive conditions in the SUS decreased between 2014 and 2024 (from 10.6% to 8.4%), suggesting some improvements, though the absolute number of cases remained stable.
Significant financial impact: Despite the proportional reduction, absolute costs increased, surpassing R$1.2 billion in 2024, indicating that remaining cases are potentially more complex and costly.
Marked regional differences:
- The North and Northeast maintain higher ICSAP rates, reflecting possible gaps in primary care access and coverage.
- The Southeast accounts for the highest absolute costs and hospitalizations due to its larger population.
Age and gender profiles: Older adults (45–64 and 65+) account for most hospitalizations and costs, reinforcing the impact of population aging.
Costs and length of stay: Maybe there is no direct relationship between days of hospitalization and cost, as expenses may depend on several factors beyond duration, including the type and complexity of treatment (e.g., surgeries, ICU), medications and supplies, comorbidities and others. Thus, short stays can be expensive, while some long stays may involve relatively low costs.
Main disease categories:
- Urinary tract infections lead in admission volume.
- Cerebrovascular diseases generate the highest total costs.
- Bacterial pneumonia, diabetes mellitus, and heart failure also have a significant impact.
- Vaccine-preventable conditions still account for over 23,000 admissions, highlighting gaps in preventive coverage.